Project Foundations for Data Science: FoodHub Data Analysis¶

Marks: 40

Context¶

The number of restaurants in New York is increasing day by day. Lots of students and busy professionals rely on those restaurants due to their hectic lifestyles. Online food delivery service is a great option for them. It provides them with good food from their favorite restaurants. A food aggregator company FoodHub offers access to multiple restaurants through a single smartphone app.

The app allows the restaurants to receive a direct online order from a customer. The app assigns a delivery person from the company to pick up the order after it is confirmed by the restaurant. The delivery person then uses the map to reach the restaurant and waits for the food package. Once the food package is handed over to the delivery person, he/she confirms the pick-up in the app and travels to the customer's location to deliver the food. The delivery person confirms the drop-off in the app after delivering the food package to the customer. The customer can rate the order in the app. The food aggregator earns money by collecting a fixed margin of the delivery order from the restaurants.

Objective¶

The food aggregator company has stored the data of the different orders made by the registered customers in their online portal. They want to analyze the data to get a fair idea about the demand of different restaurants which will help them in enhancing their customer experience. Suppose you are hired as a Data Scientist in this company and the Data Science team has shared some of the key questions that need to be answered. Perform the data analysis to find answers to these questions that will help the company to improve the business.

Data Description¶

The data contains the different data related to a food order. The detailed data dictionary is given below.

Data Dictionary¶

  • order_id: Unique ID of the order
  • customer_id: ID of the customer who ordered the food
  • restaurant_name: Name of the restaurant
  • cuisine_type: Cuisine ordered by the customer
  • cost: Cost of the order
  • day_of_the_week: Indicates whether the order is placed on a weekday or weekend (The weekday is from Monday to Friday and the weekend is Saturday and Sunday)
  • rating: Rating given by the customer out of 5
  • food_preparation_time: Time (in minutes) taken by the restaurant to prepare the food. This is calculated by taking the difference between the timestamps of the restaurant's order confirmation and the delivery person's pick-up confirmation.
  • delivery_time: Time (in minutes) taken by the delivery person to deliver the food package. This is calculated by taking the difference between the timestamps of the delivery person's pick-up confirmation and drop-off information

Let us start by importing the required libraries¶

In [2]:
# import libraries for data manipulation
import numpy as np
import pandas as pd

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# to restrict the float value to 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Understanding the structure of the data¶

In [3]:
# uncomment and run the following lines for Google Colab
# from google.colab import drive
# drive.mount('/content/drive')
In [3]:
# read the data
df = pd.read_csv('foodhub_order.csv')
# returns the first 5 rows
df.head()
Out[3]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time
0 1477147 337525 Hangawi Korean 30.750 Weekend Not given 25 20
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.080 Weekend Not given 25 23
2 1477070 66393 Cafe Habana Mexican 12.230 Weekday 5 23 28
3 1477334 106968 Blue Ribbon Fried Chicken American 29.200 Weekend 3 25 15
4 1478249 76942 Dirty Bird to Go American 11.590 Weekday 4 25 24

Observations:¶

The DataFrame has 9 columns as mentioned in the Data Dictionary. Data in each row corresponds to the order placed by a customer.

Question 1: How many rows and columns are present in the data?¶

In [4]:
# Write your code here
df.shape
Out[4]:
(1898, 9)

Observations:¶

There are 1898 rows and 9 columns in the dataset.

Question 2: What are the datatypes of the different columns in the dataset? (The info() function can be used)¶

In [5]:
# Use info() to print a concise summary of the DataFrame
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1898 non-null   object 
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(1), int64(4), object(4)
memory usage: 133.6+ KB

Observations:¶

  • There are 5 Numerical type columns (1 float & 4 int), and 4 object type columns
  • Rating column is object type although it can be converted to numerical column for better analysis
  • Although order_id and customer_id are integer type, but essentially they are categorical variables as we cannot do any numerical or statistical analysis on their values. Their averages or standard deviations have no significant meaning or value, as they are numbers given to identify a customer or an order.

Question 3: Are there any missing values in the data? If yes, treat them using an appropriate method¶

In [17]:
#lets see if there any null values in the data set
df.isnull().sum()
Out[17]:
order_id                 0
customer_id              0
restaurant_name          0
cuisine_type             0
cost_of_the_order        0
day_of_the_week          0
rating                   0
food_preparation_time    0
delivery_time            0
dtype: int64
In [7]:
#Now lets check why the rating column is object type
df['rating'].unique()
Out[7]:
array(['Not given', '5', '3', '4'], dtype=object)
In [12]:
#seems like there are customers who havent given any rating on their order, lets count their total number & their percentage
df.loc[df['rating']=='Not given','rating'].value_counts(), df.loc[df['rating']=='Not given','rating'].value_counts()*100/df.shape[0]
Out[12]:
(Not given    736
 Name: rating, dtype: int64,
 Not given    38.777661
 Name: rating, dtype: float64)

Observations:¶

  • We have no missing values in all columns except for the fact that rating column has 736 values as 'Not given' accounting for almost 39% of the values, which can be treated as missing value
  • These 'Not given' should actually be read as missing values (NaN). Let's replace such data points with null values.
In [7]:
#lets replace the data with nan values
df['rating']=df['rating'].replace('Not given',np.nan)

#lets conver the data to float type
df['rating']=df['rating'].astype(float)

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1898 entries, 0 to 1897
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               1898 non-null   int64  
 1   customer_id            1898 non-null   int64  
 2   restaurant_name        1898 non-null   object 
 3   cuisine_type           1898 non-null   object 
 4   cost_of_the_order      1898 non-null   float64
 5   day_of_the_week        1898 non-null   object 
 6   rating                 1162 non-null   float64
 7   food_preparation_time  1898 non-null   int64  
 8   delivery_time          1898 non-null   int64  
dtypes: float64(2), int64(4), object(3)
memory usage: 133.6+ KB
  • We have converted 'Not given' values to nan values and converted the column type to float for better analysis
  • Now we have 8 columns, with 7 columns have no missing values, only rating has missing values as we have observed before (736)

Question 4: Check the statistical summary of the data. What is the minimum, average, and maximum time it takes for food to be prepared once an order is placed?¶

In [26]:
# Write your code here
df.describe().T
Out[26]:
count mean std min 25% 50% 75% max
order_id 1898.000 1477495.500 548.050 1476547.000 1477021.250 1477495.500 1477969.750 1478444.000
customer_id 1898.000 171168.478 113698.140 1311.000 77787.750 128600.000 270525.000 405334.000
cost_of_the_order 1898.000 16.499 7.484 4.470 12.080 14.140 22.297 35.410
rating 1162.000 4.344 0.741 3.000 4.000 5.000 5.000 5.000
food_preparation_time 1898.000 27.372 4.632 20.000 23.000 27.000 31.000 35.000
delivery_time 1898.000 24.162 4.973 15.000 20.000 25.000 28.000 33.000

Observations:¶

  • Cost of Order: The average price is around 16.5, and of the prices fall within 14.14, which indicates majority of prices fall around these prices. 75% of the prices fall within 22.3, and few orders cost 35$.
  • Rating: Most of cutomers have rated 4 or 5, which means usually rating is done by satisfied customers as there is not a single value with either 1 or 2 rating.
  • food_preparation_time: Mean and median value for food preparation time is 27 which means most of the values are around 27 minutes. 75% of orders are prepared within 31 mins and maximum time is 35 mins, meaning there is little delay in few of the orders otherwise most of the orders are prepared within half hour. Same goes for minimum time, as few orders are prepared within 20 mins, while only 25% are prepared with 23 mins which is quite close to the mean.
  • delivery_time: 50% of orders are delivered within 25 mins, while 75% are delivered within 28mins. 25% are delivered within 20 mins, while max time for delivery is 33 mins.
  • Considering food preparation and delivery time stats we can say that customers mostly get their food within 1 hour after placing orders.

Question 5: How many orders are not rated?¶

In [46]:
# Write the code here
df['rating'].isnull().sum()
Out[46]:
736
In [47]:
df['rating'].isnull().sum()*100/df.shape[0]
Out[47]:
38.77766069546892

Observations:¶

*There are 736 orders which are not rated, which account for almost 39% of the orders. We have already converted them to NaN values in the previous segment.

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Question 6: Explore all the variables and provide observations on their distributions. (Generally, histograms, boxplots, countplots, etc. are used for univariate exploration)¶

Analysis of Costs of the Orders

In [13]:
# Lets check first the distribution of cost_of_order
sns.histplot(data=df, x='cost_of_the_order',kde='True')
plt.show()
sns.boxplot(data=df, x='cost_of_the_order')
Out[13]:
<Axes: xlabel='cost_of_the_order'>
In [5]:
#lets create price bins for prices between 5-15,15-25 & 25-35 to see if there is any pattern
df['price_bin']=pd.cut(df['cost_of_the_order'],bins=[5,15,25,35],labels=["Low","Medium","High"])
sns.histplot(data=df, x='price_bin')
Out[5]:
<Axes: xlabel='price_bin', ylabel='Count'>
  • THe distribution is a little skewed twoards right
  • There are no outliers in the cost of order and most of the orders are below the mean value
  • The shape of density curve is like multimodal, and there is gap or little orders between these modes which indicate that there might be some food types or restaurants that have expensive items than usual orders. There seems to orders of 3 categories: below the mean (around 16), between 20-25 and between 30-35.
  • The orders in low price bin are highest in quantity

Observations on Ratings

In [20]:
sns.countplot(data=df, x='rating')
Out[20]:
<Axes: xlabel='rating', ylabel='count'>
  • Most of the ratings are 5, few are 3. It means that usually satisfied customers rate the orders.
  • As we have seen before, there were 736 orders that were not rated, which are actually greater than the number of orders that got maximum rating. It means that all the quite large number of orders are unrated.

Observations on Food Preparation and Delivery Time

In [32]:
plt.figure(figsize = (5, 3))
sns.countplot(data=df, x='food_preparation_time');
plt.show()
plt.figure(figsize = (4, 2))
sns.boxplot(data=df,x='food_preparation_time')
plt.show()
In [33]:
plt.figure(figsize = (5, 3))
sns.countplot(data=df, x='delivery_time');
plt.show()
plt.figure(figsize = (4, 2))
sns.boxplot(data=df,x='delivery_time')
plt.show()
In [53]:
df.loc[df['delivery_time']>30, 'delivery_time'].value_counts().sum()
Out[53]:
149

Observations

  • The food preparation time is evenly distributed between 20-35 minutes
  • Most of the orders are deliverd with 15-30 mins and few orders take more than 30 mins (149 orders)
  • There are no outliers for both food preparation and food delivery time.
In [61]:
df['restaurant_name'].nunique(),df['cuisine_type'].nunique()
Out[61]:
(178, 14)
In [64]:
sns.countplot(data=df, x='cuisine_type')
plt.xticks(rotation=90)
plt.show()
In [7]:
sns.countplot(data=df, x='price_bin', hue='day_of_the_week')
Out[7]:
<Axes: xlabel='price_bin', ylabel='count'>
In [78]:
df.loc[df['day_of_the_week']=='Weekend','day_of_the_week'].value_counts()/df.shape[0]
Out[78]:
Weekend   0.712
Name: day_of_the_week, dtype: float64

Observations on Week of Day & Cuisine Type

  • Most common orders are for American & Japanese Food; Italian and Chinese follow them in no of orders.
  • 71.2% of the orders are placed during the weekends.

Question 7: Which are the top 5 restaurants in terms of the number of orders received?¶

In [135]:
df['restaurant_name'].value_counts().head()
Out[135]:
Shake Shack                  219
The Meatball Shop            132
Blue Ribbon Sushi            119
Blue Ribbon Fried Chicken     96
Parm                          68
Name: restaurant_name, dtype: int64
In [136]:
df['restaurant_name'].value_counts().shape[0]/df.shape[0]
Out[136]:
0.09378292939936776

Observations:

  • The top five restaurants are Shake Shack, The Meatball shop, Blue Ribbon Sushi, Blue Ribbon Fried Chicken and Parm
  • Top five restaurants account for 9.4% of the total orders

Question 8: Which is the most popular cuisine on weekends?¶

In [118]:
df.loc[df['day_of_the_week']=='Weekend','cuisine_type'].value_counts().head()
Out[118]:
American    415
Japanese    335
Italian     207
Chinese     163
Mexican      53
Name: cuisine_type, dtype: int64
In [4]:
df.loc[df['day_of_the_week']=='Weekday','cuisine_type'].value_counts().head()
Out[4]:
American    169
Japanese    135
Italian      91
Chinese      52
Mexican      24
Name: cuisine_type, dtype: int64

Observations:¶

  • American cuisine is the most popular on weekends
  • As we can see from above cuisine type graphs, most popular cuisine types are the ones that are popular on weekdays as well as on weekends.

Question 9: What percentage of the orders cost more than 20 dollars?¶

In [137]:
df[df['cost_of_the_order']>20].shape[0]/df.shape[0]
Out[137]:
0.2924130663856691
In [139]:
df[df['cost_of_the_order']>20].shape[0]
Out[139]:
555

Observations:¶

  • 29.24% of the orders cost more than 20 dollars
  • 555 orders costed more than 20 dollars

Question 10: What is the mean order delivery time?¶

In [131]:
df['delivery_time'].mean()
Out[131]:
24.161749209694417
In [140]:
df['delivery_time'].mean()+df['food_preparation_time'].mean()
Out[140]:
51.53371970495259

Observations:¶

  • The mean delivery time is 24.162 mins
  • The mean time to get food after placement of order is 51.53 mins

Question 11: The company has decided to give 20% discount vouchers to the top 3 most frequent customers. Find the IDs of these customers and the number of orders they placed¶

In [30]:
df['customer_id'].value_counts().nlargest(10)
Out[30]:
52832     13
47440     10
83287      9
250494     8
259341     7
82041      7
65009      7
276192     7
97079      6
97991      6
Name: customer_id, dtype: int64

Observations:¶

  • The customer IDs of top 3 most frequent customers are 52832 (13 orders), 47440 (10 orders) and 83287 (9 orders).

Multivariate Analysis¶

Question 12: Perform a multivariate analysis to explore relationships between the important variables in the dataset. (It is a good idea to explore relations between numerical variables as well as relations between numerical and categorical variables)¶

In [14]:
df.columns
Out[14]:
Index(['order_id', 'customer_id', 'restaurant_name', 'cuisine_type',
       'cost_of_the_order', 'day_of_the_week', 'rating',
       'food_preparation_time', 'delivery_time'],
      dtype='object')
In [16]:
plt.figure(figsize=(8,4))
sns.heatmap(df[['cost_of_the_order','rating','delivery_time','food_preparation_time']].corr(),annot=True,vmin=-1,vmax=1);
plt.show()
  • There seems no significant correlation between the numerical variables. Maximum correlation among the variables is between cost of the order & food preparation time which is 4.2% which is not enough to make any conclusion.

Lets first check relationship of cost of orders to the categorical variables.

In [29]:
#lets observe the distribution of cost for each cuisine type
sns.boxplot(data=df,x='cuisine_type',y='cost_of_the_order')
plt.xticks(rotation=90)
plt.show()
In [67]:
#now lets check the cost of order on day of the week for each cuisine type
plt.figure(figsize=(3,1))
sns.relplot(data=df,x='day_of_the_week',y='cost_of_the_order',col='cuisine_type',kind='scatter',col_wrap=4)
plt.show()
<Figure size 300x100 with 0 Axes>

We can see that there is difference in number of orders in weekday and weekend for few of the cuisine types, lets explore it further

In [23]:
df.groupby(['cuisine_type', 'day_of_the_week']).agg({'cost_of_the_order': 'mean', 'order_id': 'nunique', 'rating': 'mean'})
Out[23]:
cost_of_the_order order_id rating
cuisine_type day_of_the_week
American Weekday 15.308 169 4.252
Weekend 16.732 415 4.319
Chinese Weekday 17.148 52 4.231
Weekend 16.036 163 4.364
French Weekday 20.866 5 4.500
Weekend 19.382 13 4.250
Indian Weekday 17.162 24 4.429
Weekend 16.801 49 4.583
Italian Weekday 16.890 91 4.403
Weekend 16.211 207 4.336
Japanese Weekday 16.368 135 4.365
Weekend 16.279 335 4.377
Korean Weekday 12.180 2 4.500
Weekend 14.333 11 4.000
Mediterranean Weekday 14.040 14 4.000
Weekend 16.102 32 4.333
Mexican Weekday 16.556 24 4.375
Weekend 17.104 53 4.438
Middle Eastern Weekday 19.833 17 4.467
Weekend 18.283 32 4.053
Southern Weekday 19.227 6 4.000
Weekend 19.341 11 4.444
Spanish Weekday 12.130 1 NaN
Weekend 19.618 11 4.833
Thai Weekday 15.463 4 4.000
Weekend 20.207 15 4.750
Vietnamese Weekday 11.917 3 3.000
Weekend 13.608 4 4.667

Observations:

  • In case of American cuisines, the price is increased over the weekend but the orders have increased as well. Similar trend in found in cuisines which are less frequently ordered.
  • In case of Chinese, Japanese & Italian, prices are decreased over the weekend, and significant increase is found in orders as well. Same order can be seen in other cuisine types where prices has been decreased
  • We can say that orders increase generally over the weekend irrespective of cuisine type or change in price.
  • The top for popular cuisine types have same distribution for cost of order.
  • Ratings are also increased for most cuisine types over the weekend irrespective of decrease or increase in price.
In [117]:
df['rating'].mean()
Out[117]:
4.344234079173838
In [22]:
plt.figure(figsize=(5,5))
sns.boxplot(data=df,x='cuisine_type',y='rating')
plt.xticks(rotation=90)
plt.show()
In [10]:
#Lests compare mean prices for cuisine types and mean ratings
df_one=df.groupby(['cuisine_type']).agg({'rating': 'mean', 'cost_of_the_order': 'mean'})
sns.lmplot(data=df_one,x='cost_of_the_order',y='rating',ci=False)
Out[10]:
<seaborn.axisgrid.FacetGrid at 0x7a5318434a30>
In [11]:
df_one.nlargest(10,columns='rating') , df_one.nlargest(10,columns='cost_of_the_order')
Out[11]:
(              rating  cost_of_the_order
 cuisine_type                           
 Spanish        4.833             18.994
 Thai           4.667             19.208
 Indian         4.540             16.920
 Mexican        4.417             16.933
 Japanese       4.374             16.305
 Italian        4.360             16.419
 Chinese        4.338             16.305
 Southern       4.308             19.301
 French         4.300             19.794
 American       4.299             16.320,
                 rating  cost_of_the_order
 cuisine_type                             
 French           4.300             19.794
 Southern         4.308             19.301
 Thai             4.667             19.208
 Spanish          4.833             18.994
 Middle Eastern   4.235             18.821
 Mexican          4.417             16.933
 Indian           4.540             16.920
 Italian          4.360             16.419
 American         4.299             16.320
 Chinese          4.338             16.305)
In [26]:
#lets see if there is any relationship between mean cost of the order and mean rating for the cuisine types
df_one[['cost_of_the_order','rating']].corr()
Out[26]:
cost_of_the_order rating
cost_of_the_order 1.000 0.626
rating 0.626 1.000
In [165]:
#lets check the unrated orders for each cuisine type
df.groupby('cuisine_type',as_index=False)['rating'].apply(lambda x: x.isnull().sum())
Out[165]:
cuisine_type rating
0 American 216
1 Chinese 82
2 French 8
3 Indian 23
4 Italian 126
5 Japanese 197
6 Korean 4
7 Mediterranean 14
8 Mexican 29
9 Middle Eastern 15
10 Southern 4
11 Spanish 6
12 Thai 10
13 Vietnamese 2
In [44]:
#lets see the distribution of ratings on day of the week
sns.countplot(data=df,x='rating',hue='day_of_the_week')
plt.show()
In [29]:
#lets check the unrated orders for day fo the week
df.groupby('day_of_the_week',as_index=False)['rating'].apply(lambda x: x.isnull().sum())
Out[29]:
day_of_the_week rating
0 Weekday 207
1 Weekend 529
In [15]:
df.groupby(['day_of_the_week'])['order_id'].count()
Out[15]:
day_of_the_week
Weekday     547
Weekend    1351
Name: order_id, dtype: int64
In [41]:
#lets check the unrated orders for each cuisine type
df.groupby(['day_of_the_week'])['rating'].mean()
Out[41]:
day_of_the_week
Weekday   4.309
Weekend   4.359
Name: rating, dtype: float64
In [50]:
df.columns
Out[50]:
Index(['order_id', 'customer_id', 'restaurant_name', 'cuisine_type',
       'cost_of_the_order', 'day_of_the_week', 'rating',
       'food_preparation_time', 'delivery_time', 'price_bin'],
      dtype='object')
In [54]:
df.groupby(['price_bin']).agg({'rating':'mean','order_id':'nunique'})
Out[54]:
rating order_id
price_bin
Low 4.312 999
Medium 4.402 585
High 4.338 304

Observations:¶

  • The ratings for each cuisine type is almost similar in terms of distribution of ratings, except for two less popular cuisine types.
  • Ratings follow similar trend over weekdays and weekends in terms of distribution however the average ratings on weekends is higher than weekdays
  • No of unrated orders are proportionate to number of total orders, i.e. American cuisine has most orders so it also has most unrated orders.
  • Average rating for low price and high price are lower than the mean, however medium price orders have better ratings
  • Similar percentage of orders are rated for both weekends and weekdays

Question 13: The company wants to provide a promotional offer in the advertisement of the restaurants. The condition to get the offer is that the restaurants must have a rating count of more than 50 and the average rating should be greater than 4. Find the restaurants fulfilling the criteria to get the promotional offer¶

In [195]:
promo=pd.Series(df.groupby('restaurant_name').filter(lambda group: group['rating'].mean() > 4 and group['rating'].count() > 50)['restaurant_name'].unique())
promo
Out[195]:
0    Blue Ribbon Fried Chicken
1            The Meatball Shop
2                  Shake Shack
3            Blue Ribbon Sushi
dtype: object

Observations:¶

  • Only four restaurants fulfil the criteria: Blue Ribbon Fried Chicken, The Meatball Shop, Shake Shack and Blue Ribbon Sushi.

Question 14: The company charges the restaurant 25% on the orders having cost greater than 20 dollars and 15% on the orders having cost greater than 5 dollars. Find the net revenue generated by the company across all orders¶

In [100]:
# orders above 20 dollars revenue
rev_a=np.round(df.loc[df['cost_of_the_order']>20,'cost_of_the_order'].sum()*.25,2)
orders_a=df.loc[df['cost_of_the_order']>20,'order_id'].nunique()
rev_a,orders_a
Out[100]:
(3688.73, 555)
In [101]:
#orders above 5 dollars
rev_b=np.round(df.loc[(df['cost_of_the_order']<=20) & (df['cost_of_the_order']>5),'cost_of_the_order'].sum()*.15,2)
orders_b=df.loc[(df['cost_of_the_order']<=20) & (df['cost_of_the_order']>5),'order_id'].nunique()
rev_b, orders_b
Out[101]:
(2477.58, 1334)
In [90]:
#total revenue
total_rev=round(rev_a+rev_b,2)
total_rev
Out[90]:
6166.31

Observations:¶

  • Company earns a higher revenue on orders above 25 dollars although the quantity of those orders is lower than that of orders above 5 dollars

Question 15: The company wants to analyze the total time required to deliver the food. What percentage of orders take more than 60 minutes to get delivered from the time the order is placed? (The food has to be prepared and then delivered)¶

In [103]:
# we can create a column named 'total_time'
df['total_time']=df['delivery_time']+df['food_preparation_time']
df.head(2)
Out[103]:
order_id customer_id restaurant_name cuisine_type cost_of_the_order day_of_the_week rating food_preparation_time delivery_time price_bin total_time
0 1477147 337525 Hangawi Korean 30.750 Weekend NaN 25 20 High 45
1 1477685 358141 Blue Ribbon Sushi Izakaya Japanese 12.080 Weekend NaN 25 23 Low 48
In [107]:
#lest calculate how many orders take more than 60 mins
df.loc[df['total_time']>60,'total_time'].shape[0] , df.loc[df['total_time']>60,'total_time'].shape[0]/df.shape[0]
Out[107]:
(200, 0.1053740779768177)
In [109]:
sns.lineplot(data=df, x='total_time', y='rating', ci=False)
<ipython-input-109-7ea2c51fb74d>:1: FutureWarning: 

The `ci` parameter is deprecated. Use `errorbar=('ci', False)` for the same effect.

  sns.lineplot(data=df, x='total_time', y='rating', ci=False)
Out[109]:
<Axes: xlabel='total_time', ylabel='rating'>

Observations:¶

  • 10% of the orders take more than 60 mins to reach the customer after placement of order, and there are 200 such orders in the dataset
  • There is no significant impact on rating of these orders

Question 16: The company wants to analyze the delivery time of the orders on weekdays and weekends. How does the mean delivery time vary during weekdays and weekends?¶

In [111]:
# Write the code here
df.groupby(['day_of_the_week'])['delivery_time'].mean()
Out[111]:
day_of_the_week
Weekday   28.340
Weekend   22.470
Name: delivery_time, dtype: float64
In [4]:
import plotly.express as px
In [5]:
fig = px.box(df, x="day_of_the_week", y="delivery_time")
fig.show()
In [8]:
fig = px.bar(df, x="delivery_time", y="rating")
fig.show()

Observations:¶

  • Delivery time over the weekend is around 6 minutes better than that for weekdays
  • Delivery time between 24 - 30 mins are rated frequently by customers. Orders with Delivery time over 30 mins are rated least. Quick deliveries also get enough number of ratings.

Conclusion and Recommendations¶

Question 17: What are your conclusions from the analysis? What recommendations would you like to share to help improve the business? (You can use cuisine type and feedback ratings to drive your business recommendations)¶

Conclusions:¶

The dataset contained 1898 orders for FoodHub which offers access to multiple restaurants in New York for food delivery service. On the basis of analysis done on the dataset, following conclusions can be made:

  • The app is used during the weekend more than weekdays. More customers rate their orders on weekends as compared to weekday orders. The ratio of cuisine types ordered over weekend and weekdays is similar. The number of orders and mean ratings of orders over weekends is higher irrespective of change in price.
  • The quantity of low cost orders (5-15 dollars) is highest, while their rating is low. Medium cost orders (15-25 dollars) have the best rating average.
  • Delivery time over the weekend is better than weekdays.
  • Most popular cuisines have mean rating better than overall mean rating except for American cuisines.
  • Customer Loyalty is an area where there is room for improvement. Quantity of repeating customers is not very high.
  • No of customers ratings is low, although the overall mean rating is better as most orders are rated 5.

Recommendations:¶

  • The Company should revise its commission policy and commission on orders during weekend shall be a bit higher than weekdays. Moreover, commission on orders may be revised to 3 categories: Low Orders (5-15 dollars) should have least commission (5% as before), Medium cost orders (15-25 dollars) orders should have highest commission (20% as before) and low commission (10-12%) should be charged on High cost orders (25+ dollars).
  • To increase customer loyalty, company may introduce discount for customers on their every 3rd or 5th order, or any similar kind of policy may be introduced.
  • Ratings are crucial in analysing business & customer experience. Therefore, bonus or reward points should be provided to customers completing reviews in order to get better data to make impovement in business.
  • Delivery Time may be improved during the weekdays.
  • More data should be obtained for further analysis like profile of customers (DOB, Company etc), time of the order (breakfast, lunch, dinner etc) and delivery address (or locality/ZIP Code).